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Abstract. Statistical methods have been widely employed to study the fundamental 
properties of language. In recent years, methods from complex and dynamical systems 
proved useful to create several language models. Despite the large amount of studies 
devoted to represent texts with physical models, only a limited number of studies 
have shown how the properties of the underlying physical systems can be employed 
to improve the performance of natural language processing tasks. In this paper, I 
address this problem by devising complex networks methods that are able to improve 
the performance of current statistical methods. Using a fuzzy classification strategy, I 
show that the topological properties extracted from texts complement the traditional 
textual description. In several cases, the performance obtained with hybrid approaches 
outperformed the results obtained when only traditional or networked methods were 
used. Because the proposed model is generic, the framework devised here could be 
straightforwardly used to study similar textual applications where the topology plays 
a pivotal role in the description of the interacting agents. 
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1. Introduction 

The human language represents a major factor responsible for the success of our species; 
and its written form is one of the main expression used to convey and share information. 
Owing to the ubiquity of language in several contexts, many linguistic aspects have been 
studied via the application of methods and tools borrowed from diverse scientific fields. 
As a consequence, several findings related to the origins, organization and structure 
of the language have been unveiled. One of the most fundamental patterns arising 
from statistical analysis of huge amounts of text is the Zipf’s law, which states that 
the frequency of the words decreases inversely to their rank [1-3]. Other fundamental 
recurrent pattern is the ffeap’s law, which states that the vocabulary size grows slowly 
with the number of tokens of the document [4-6]. More recently, concepts from Physics 
have been applied to model several language features. For example, with regard to long- 
scale properties, concepts from dynamical systems have successfully been employed to 
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compare the burstiness of the spatial disbribution of words in documents [7,8]. In a 
similar fashion, the spatial distribution of words in documents and analogous systems 
have also been studied in terms of level statistics [9], entropy [10] and intermittency 
measurements [11] 

A well-known approach to study written texts is the word adjacency network 
model [12-14], which considers short-scale textual properties to form the networks. 
In this representation, relevant words conveying meaning are modeled as nodes 
and adjacency relationships are used to establish links. Using this model, several 
characteristics of texts and languages have been inferred from statistical analyses 
performed in the structure of networks [15]. Language networks have been increasingly 
employed to understand theoretical linguistic aspects, such as the origins of fundamental 
properties [16] and the underlying mechanisms behind language acquisition in early 
years [17]. In practical terms, networks have been applied in the context of 
machine translations [18], autommatic summarization [19], sense disambiguation [20,21], 
complexity/quality analysis [22,23] and document classification [24], Despite the relative 
success of applying networks concepts to better understand language phenomena, in 
many real-world applications attributes extracted from networked models have not 
contributed to the advancement of the state of art. For example, when one considers 
the document classification task, a strong dependency of network features on textual 
characteristics have been observed. However, the best performance are still achieved 
with traditional statistical natural language processing features. In this context, the 
present study address this problem by devising methods that effectively take advantage 
of network properties to boost the performance of the textual classification task. Here 
I focus on the text classification based on stylistic features, where a text is classified 
according to stylistic marks left by specific authors [25] or literary genres [26]. Upon 
introducing a hybrid classifier relying on the fuzzy definition of supervised pattern 
recognition methods [27], I show that the performance of style-based classifications 
can be significantly improved when topological information is included in traditional 
models. Given the generality of the proposed method, the framework devised here 
could be applied to improve the characterization and description of many related textual 
applications. 

2. Representing texts as networks 

There are several ways to map texts into networks [12,15,19,28-30]. The most suitable 
form depends on the context of the application. In occasions where the semantics is 
relevant, the words sharing some semantical relationship are linked [20,22,31-33]. In 
a similar fashion, other semantic-based models link the words appearing in a given 
context [34, 35] (e.g. in the same sentence or paragraph). A general model for 
establishing significant “semantical” links between co-occurring elements was devised 
in [36]. More specifically, the model considers the existence of a set of elements V = 
{I’l,..., v n }, which may occur in one or more sets of a given collection £ = {5i,..., Sn}- 
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Given two elements a 6 V and /3 G V, the model computes the probability p that more 
than r sets in £ contain both elements a and f3 as 

Pt = '%2p(j)> (!) 

j>r 

where p(j) is the probability that a and f3 co-occur exactly j times in the same set. 
Given a confidence level p 0 , the strength of the link between a and f3 is s = log(po/p t ). 

In applications where the style (or structure) plays an important role, the links 
among words are established according to syntactical relationships [19,30]. A well-known 
approach for grasping stylistic features of texts is the word adjacency model [13,37,38], 
which basically connect adjacent words in the text. Differently from the model devised 
in [36], the word adjacency model captures the stylistic features of texts [39]. It has 
been shown that the adjacency model is able to capture most of the syntactical links 
with the benefit of being language independent. Despite being a simplification of the 
syntactical analysis, the adjacency model has been employed in several contexts because 
the topological properties of word adjacency and syntactical networks are similar. Such 
high degree of similarity can be explained by the fact that most of the syntactical 
links occurs among neighbouring words [30] . In the current paper, the traditional word 
adjacency representation was adopted. 

Before mapping the text into a word adjacency network, some pre-processsing steps 
are usually applied. First, words conveying low semantic content, such as articles and 
prepositions, are removed from the text. These words, referred to as stopwords, are 
disregarded because they just serve to link content words. As a consequence, they can be 
straightforwardly replaced by network edges. The remaining words are then lemmatized, 
i.e. they are transformed to their canonical forms. To assist the lcmmatization process, 
all words are labeled with their part-of-speech tags. Particularly, in this study, I used 
a model based on maximum entropy [40]. To exemplify the construction of a word 
adjacency network, I show in table 1 the pre-processing steps performed in a short text. 
The corresponding network is shown in figure 1. 

3. Topological characterization of networks 

A network can be defined as G = {V, E}, where V denotes the set of nodes and E 
denotes the set of edges, which serve to link nodes. An unweighted network can be 
represented by an adjacency matrix A = {a^}, where each element stores the 
information concerning the connectivity of nodes i and j. If i and j are connected, then 
aij = 1. Otherwise, = 0. Note that, in a undirected network devoid of self loops, 
A T = A. Currently, there are several network measurements available to characterize 
the topology of complex networks [41], Here, I describe the main measurements applied 
for text analysis: 

• Degrees: the degree ( k ) is the number of edges connected to the node, i.e. 

ki = J2j a ij- A relevant feature related to the degree that has been useful for 
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Table 1. Pre-processing steps applied to the poem “In the Middle of the Road”, 
by Carlos Drummond de Andrade. First, stopwords are removed. Then, the 
remaining words are mapped to their canonical form via lemmatization. Note 
that transformations fatigued i —> fatigue and retinas H > retina occur during the 
lemmatization step. The network obtained from the poem is illustrated in figure 1. 


Processing step 

Outcome 

Original text 

In the middle of the road there was a stone / there was a stone 
in the middle of the road there was a stone in the middle of the 
road there was a stone. Never should I forget this event / in the 
lifetime of my fatigued retinas / Never should I forget that in 
the middle of the road / there was a stone / there was a stone 
in the middle of the road / in the middle of the road there was 
a stone. 

Stopwords removal 

middle road stone stone middle road stone middle road stone never 
forget event lifetime fatigued retinas never forget middle road stone 
stone middle road middle road stone 

Lemmatization 

middle road stone stone middle road stone middle road stone never 
forget event lifetime fatigue retina never forget middle road stone 
stone middle road middle road stone 


MIDDLE 



NAME RETINA 


Figure 1. Example of word adjacency network created from the poem “In the Middle 
of the Road”, by Carlos Drummond de Andrade. The pre-processing steps performed 
to generate this word adjacency network are shown in table 1. 


text analysis are the are the average node degree and the standard deviation of the 
neighbors, which are given by 




( 2 ) 


















5 




ki 


^ ^ (^imki i 


2-i 1/2 


( 3 ) 


In text networks, both k^ and A k' n > have been useful to quantify the structural 
organization of texts [42], 

• Accessibility: the accessibility measurement is a extension of the node degree 

centrality [43]. It is defined as a normalization of the diversity measurement [43], 

which quantifies the irregularity of a accessing neighbors through self-avoiding 
random walks [43]. To define the accessibility, let P h (i,j) be the probability 
of a random walker starting at node i to reach node j in exactly h steps. 
The heterogeneity of access to neighbors can be quantified with the diversity 
measurement: 

A = -£fW>,j)ioga,ftj), ( 4 ) 

j 

Given eq. 4, the accessibility (a) is computed as 

«i h) = exp(<5f } ). (5) 


It can be shown that the accessibility is bounded according to the relation 0 < 
a ( h ) < w here rih is the number of nodes at the h -th concentric level [44], An 
example of the computation of the accessibility in a small network is shown in 
figure 2. In the example, nodes 2, 3, 4 and 5 belong to the first concentric level and 
nodes 6, 7, 8, 9 and 10 belong to the second concentric level. When one considers a 
regular access to the second level (red configuration), the accessibility corresponds 
to the total number of nodes located at the h -th concentric level. When some 
nodes are more accessed than others, the accessibility decreases because less nodes 
are effectively accessed. 

In textual networks, it has been shown that the accessibility is more advantageous 
than other traditional centrality measurements as it is able to capture more 
information at further hierarchical levels [44], This measurement has been 
successfully applied to detect core concepts in texts [19]. Furthermore, it has 
also been employed to generate informative summaries [19]. A dependency of the 
distribution of this measurement with stylistic features of text was observed in [45] . 


• Betweenness: the betweenness is a centrality measurement that considers that a 
node is relevant if it is highly accessed via shortest paths. If n S i t is the number of 
geodesic paths between s and t passing through node i; and n st is the total number 
of shortest paths linking s and t , then the betweenness is defined as: 


R _ 1 n sit 

s t 


( 6 ) 


In word adjacency networks, high frequent words usually take high values of 
betweenness. However, some words may act as articulation points whenever 
they link two semantical contexts or communities [46]. It has been shown that 
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P(1,6) = 1/5 P(1,7)=1/5 



Figure 2. Probabilities of transition from node 1 considering h = 2 steps for two 
distinct configuration of links. The first configuration considers only the red edges and 
the second one considers both red and blue edges. Note that, in the first configuration, 
the probability to reach any node at the second level is the same. In this case, 
cq~ 2 = 5. When blue edges are included, nodes 7 and 9 tend to receive more visits 
than the other nodes, according to the considered probabilities. For this reason, the 
effective number of accessed nodes drops to a[ h ~^ = 4.71. 


the betweenness is able to identify the generality of contexts in which a word 
appears [47]. More specifically, domain-specific words tend to assume lower values 
of betweenness when compared with more generic words. 

• Assortativity: several real-world networks are formed of nodes with a specific 
type of classification. For example, in social networks, individuals may be classified 
by considering their age, sex or race. When analyzing the connectivity patterns 
of networks, it might be relevant to study how distinct classes connect to each 
other. This type of analysis is usually performed with the so-called assortativity 
measurement [48]. The assortativity can be computed as the Pearson correlation 
coefficient (r): 

e 1 Ej>< hkjdij - [e" 1 J2j>i \( k i + fc tKj] 2 (7) 

e_1 Ej>i U k i + k< j ) a ij ~ [ e_1 Ej>i U k i + k j) a ij} 2: 

where e is the total number of edges. In word adjacency networks, the assortativity 
quantifies how words with distinct frequency appear as neighbors. 
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• Clustering coefficient: the clustering coefficient (C) quantifies the local density 
of neighbors of a given node. The local definition of the clustering coefficient is 
given by the fraction of the number of triangles among all possible connected sets 
of three nodes: 


C 3 ^ ] (1'ij U ik(1]f' 

k>j>i 


k>j>i 


(i r j(ij k -(- (ijidjk T a ki a kj 


-l 


( 8 ) 


Similarly to the betweenness, the clustering coefficient is useful to detect words 
appearing in generic contexts [47]. However, differently from the betweenness, the 
clustering coefficient analyzes only the local neighborhood of nodes. 

• Average shortest path length: the average shortest path length (/) is the typical 
distance between any two nodes in the network. This measurement was used in this 
paper because it has been useful in stylistic-based applications [47]. In texts, the 
average shortest path length quantifies words relevance. More specifically, according 
to this measurement, the most relevant words are those that are close to the hubs. 


Most of the measurements described in this section are local measurements, i.e. 
each node i possesses a value X t , where X = {k, k^ n \ Ak^ n \ a, B, C, l}. For the 
purposes of this paper, it is necessary to sum up the local measurements. The most 
natural choice is to characterize the documents by computing the average (X), where 
(...) = N^ 1 Y^iLi ■ ■ ■ stands for the average computed over the N distinct words of the 
text. A disadvantage associated with this type of summing procedure is that all words 
receive the same weight, regardless of their number of occurrences in the text. To avoid 
this potential problem, I also computed the average value (.. .)* = rA 1 Yll=i ■ ■ ■ obtained 
when only the ij = 50 most frequent words are considered. The standard deviations AA" 
and the skewness 7 (A") were also used to characterize the documents. 

A drawback associated to the computation of local topological measurements in 
word adjacency networks is the high correlation found between these measurements and 
the node degree (i.e. the word frequency). To minimize this correlation, the following 
procedure was adopted. Each of the measurements was normalized by the average 
obtained over 30 texts produced using a word shuffling technique, where the frequencies 
of words are preserved. If /i(X^) and a(X^ Ri ) are the average and deviation obtained 
over the random realizations, the normalized measurement X and the error e(X) related 
to X are 




( 9 ) 


_ ffpyq - = qyT) 
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( 10 ) 







4. Traditional stylistic features 


Frequency of words and characters 

Traditional methods usually perforin statistical analysis using specific textual 
features [49] . An important contribution to the stylometry was introduced by Mostcller 
and Wallace [50] , which showed that the frequency of function words (such as any, of, a 
and on) is useful to characterize the style in texts. Frequent words have also been used 
in strategies devised by physicists where the distance of frequency ranks was used to 
compute the similarity between texts [51-53]. Another relevant feature for characterizing 
styles in texts is the frequency of character bigrams (i.e. a sequence of two adjacent 
characters). These attributes have proven useful e.g. to detect the stylistic marks of 
specific authors [54], 


Intermittence 


The uneven spatial distribution of words along texts is a feature useful for characterizing 
the style of texts [10]. The quantification of the homogeneity of the distribution of words 
along texts can be performed by using recurrence times, a standard measure employed 
to study time series [55]. In texts, the concept of time is represented in terms of the 
number of words occuring in a given interval. For each word i, the recurrence time 
Tj is defined as the number of words appearing between two successive occurrences of 
i plus one. For example, the recurrence times of the word “stone” in the lcmmatized 
text shown in table 1 are T) = 1, X 2 = 3, X 3 = 3, T 4 = 11, X 5 — 1 and X 6 = 5. If 
a word i occurs N, t times in a text comprising Nf words, it generates a sequence of 
IVj — 1 recurrence times {T\ , T 2 , ..., 1 }. In order to consider the time Tf until the 

first occurrence of i and the time X), the number of words between the last occurrence 
of i and the last word of the text, the recurrence time Tn = Tf + X) is added to the 
set of recurrence times of word i. As such, (X) = Nj^/Ni, where (•) is the average over 
distinct Tj s. Note that the average recurrence time (X) does not provide additional 
information, since it only depends on the frequency. The intermittence (or burstiness) 
of the word i is obtained from the coefficient of the variation of the recurrence times: 


I = a T /(T) = 


L (x ) 2 


-1 


1/2 


(ii) 


It has been shown that the intermittence has been useful to identify core concepts in 
texts, even when a large corpus is not available. Relevant words usually take high values 
of intermittency, i.e. / 1. Stopwords, on the other hand, are evenly distributed along 

the text [45]. To illustrate the properties of the intermittency measurement, figure 3 
shows the spatial distribution of the words long and Hobson in the book “Adventures of 
Sally”, by P.G. Wodehouse. The frequency of these words are similar, yet their values of 
intermittency are quite different. Note that, while the distribution of long is relatively 
homogeneous, Hobson is unevenly distributed along the text. Because bursty concepts 
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LONG HOBSON 



WORD POSITION WORD POSITION 


Figure 3. Profile of the spatial distribution of long (Ni = 44 and A = 1.02) 
and Hobson (Ni = 45 and Ii = 3.40) in the book “Adventures of Sally”, by P. G. 
Wodehouse. Because Hobson is unevenly distributed along the book, this word take 
high values of intermittency. 


are the most relevant words [10], in the example of figure 3, Hobson would be considered 
a relevant character in the plot. 

5. Supervised classification 

In this section, I describe the two methods employed to combine distinct strategies of 
classification. The objective here is to combine evidences from both traditional and 
CN-based methods. Before introducing the methods for combining classifiers, the main 
concepts concerning the supervised learning task are presented. 

In a typical supervised learning task, two datasets are employed: the training and 
the test dataset. The training dataset Aq r = {(aq, yi), (x 2 , 1 / 2 ), • • •, (aq, Vi)} is the set 
of instances whose classes are known beforehand. The first component of the tuples 
Xi = (/1 = ai ,/2 = a 2 ,...) represents the values of attributes used to describe the 
i-th instance. The second element y^ G y = {y±, 2 / 2 , • • •} represents the class label 
of the i-th training instance. In the supervised learning task, the objective is to 
obtain the map x ra y. The quality of the map obtained is evaluated with the test 
dataset A^ s = {(xi, yi), (x/ + i, yi+i )..., (xi +u , yi +u )}. The technique used to evaluate 
the performance of the classifiers used in this study is the well-known 10-fold cross- 
validation [56]. 

An usual procedure in classification tasks is the quantification of the relevance of 
each attribute for the task. To quantify the relevance of attributes to discriminate the 
data, several indices have been proposed [56] . A well-known index, the information gain, 
quantifies the homogeneity of the set of instances in Aq r when the value of the attribute 
is specified [57]. Whenever a single class prevails when the value of an attribute f k is 
specified, the information gain associated to fk takes high values. Mathematically, the 
information gain is defined as 

Si(A' tr , f t ) = H(X tI ) - ( 12 ) 

where H(X^ T ) is the entropy of the training dataset Aq r and 'H(X^ I \f k ) is the entropy 
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of the training dataset when the k -th attribute is specified. The quantity H(Xf T \fk) is 
computed as 

H(X tl \f k ) = -f- Y. U 6 X tI \f k = «}| H({x € X tI \f k = »}) (13) 

1 trl vev(f k ) 

where V(F k ) is the set comprising all values taken by fk in Xf T . 

Hybrid classifier 

To define this classifier, consider the following definitions. Let rtiij be the strength 
associating the i-th instance to the j-th class, where 0 < m l3 < 1. The term rn t j , 
henceforth referred to as membership strength , can be interpret as the likelihood of 
instance i to belong to class j. Note that the quantification of rn l3 depends on the 
classifier being used. A detailed description of methods for computing the membership 
strength can be found in [27,58-61]. Let be the membership strength obtained 

J (T) 

when topological features of complex networks are used and rnf ; the membership 
strength obtained when traditional statistical features are analyzed. According to the 
hybrid strategy, the combination of both topological and traditional methods is achieved 
according to the following convex combination 

= A+ (1 - A )m[J\ (14) 

where A G [0,1] accounts for the weight associated to the topological strategy. Note that 
the combination of evidences performed in equation 14 yields a membership strength 
m' j ^ } ranging in the interval [0,1]. The final decision is then made according to the rule 

Vi = max m[f\ (15) 

j 

where y t G y denotes the correct class label associated to the i-th test instance. In 
practical applications, a screening on the A can be useful to find its best value. However, 
this process might be computationally unaffordable for very large datasets. In this case, 
the process of finding the adequate value of the parameter can be made via application 
of optimization heuristics [62,63]. 

To illustrate how the decision is performed with the hybrid algorithm, figure 4 
shows the classification of a text modeled as a network. In the top panel, the central 
network is the text whose class is to be inferred. The left and right networks represent 
two networks in the training dataset. The node color denotes the node label, i.e. the 
word associated to the node. In this example, two texts are semantically similar if they 
share the same words. In the central panel, the scenario where the hybrid classified is set 
with A = 0.15 is considered. Because A is close to zero, the classification is mostly based 
on the number of shared words. In fact, the decision boundary in this case is created so 
that the test instance is classified in the same class as the right training instance. The 
bottom panel illustrates the decision made with A = 0.85. Now, the topological features 
of the text is the most prevalent feature used for the classification. As a consequence, 
the boundary decision moves to the right side so that the test instance is classified as 
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INSTANCE OF THE 
TRAINING DATASET 
Network r l5 class c. 


INSTANCE OF THE 
TEST DATASET 
Network r. 


INSTANCE OF THE 
TRAINING DATASET 
Network r 2 , class c 2 


A = 0.85 


Figure 4. Example of classification based on the hybrid classifier. In the top panel, 
the network r? might assume two possible classes: c\ and C 2 - An example for each of 
these classes is provided (see networks n and r 2 ). In the central panel, the decision 
boundary obtained for A = 0.15 is shown. Because A takes a low value in this case, the 
decision is mainly based on the number of shared nodes (words). As a consequence, r? 
is classified as belonging to class c 2 . For higher values of A, the topological features of 
texts takes over. In the bottom panel, r? is classified as belonging to class Ci because 
r? and T\ are topologically similar. 


the same class as the left training instance. Note that, the test instance as belonging to 
class Ci because the r i and r? are topologically similar. 

Tiebreaker classifier 

Similarly to the hybrid classifier, this classification scheme uses as attributes both 
traditional and topological features of texts modelled as networks. The objective of this 
approach is to use the topological attributes only when the classification performed with 
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TRADITIONAL APPROACH 



0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 
ATTRIBUTE X 


TOPOLOGICAL APPROACH 



ATTRIBUTE A 


Figure 5. Example of classification based on the tiebreaker classifier. In the left panel, 
the gray instances are the test instances that should be classified with the class labels 
red circle or blue asterisk. In this case, traditional attributes were used. Note that 
the square is significantly far from the decision boundary (dashed line). Therefore, the 
classification of this instance does not demand the use of topological features because 
A > 6 in equation 16. Differently, the pentagon is located on the decision boundary. 
Because A < 6 in this case, topological attributes are used to perform the classification. 
According to the topological attributes (see right panel), the test instance represent 
by the pentagon is classified as belonging to the blue class. 


traditional features is not reliable, as revealed by the values of membership strength. 
Consider that the two most likely classes to which the unknown instance belongs are j 
and k, according to traditional features. As a consequence, m-p > rrq-p > , for 

each class l G y — {i,j}- Let A be the difference between the probabilities associated 
to the two most likely classes, i.e. A = m-2 — m ik . If such difference surpasses a 
given threshold 9, the tiebreaker performs the classification by using only traditional 
attributes. Conversely, topological attributes are employed to infer the class of the 
unknown instance. Equivalently, 

{ (T) (T) (T) 

max,- mb if A = m), - m) k > 6 

r rV ( r ), 3 ~ ( 16 ) 

max { mb ', rn ik ; }, otherwise. 


Note that the threshold 9 ultimately decides which attributes (traditional or topological) 
are used to perform the classification. An example of classification using this approach 
is shown in figure 5. 


6. Results and discussion 

The effectiveness of the combination of traditional textual features and topological 
measurements of networks was evaluated in the context of two natural language 
processing tasks. Both tasks rely upon the characterization of stylistic marks of texts. 
The studied tasks were the authorship attribution and genre detection problems. The 
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classifiers chosen to compound the combining techniques were: kNN, Support Vector 
Machines (SVM), Random Forest (RFO) and Multilayer Perceptron (MLP). These 
classifiers were chosen because they usually yield good accuracy rates with default 
parameters [64], 

6.1. Authorship recognition 

In the authorship attribution task, the objective is to identify the authors of texts 
whose identity is lacking [65]. The dataset employed here comprises books written 
by eight authors: Arthur Conan Doyle (ACD), Bram Stoker (BRS), Charles Dickens 
(CHD), Thomas Hardy (THH), Pelham Grenville Wodehouse (PGW), Hector Hugh 
Munro (HHM) and Herman Melville (HME). To show how the topological properties of 
the networks modelling books can be useful to improve the characterization of authors’ 
styles in texts, the following combination of attributes were considered: 

• INT+CN: the intermittence of stopwords were considered along with the 
topological measurements of the network modeling the text. As stopwords, I 
considered all the words that appeared at least once in all books of the dataset. 
According to the dataset, there is a total of 340 stopwords. 

• FR+CN: the frequency of stopwords were considered along with the topological 
attributes of the networks. 

• BG+CN: the frequency of character bigrams were considered along with the 
topological attributes of the networks. All the possibilities of character bigrams 
(634) were considered in this analysis. 

For each combination of attributes, the gain in performance when the topology is 
considered as an additional feature is represented as: 

Ar W = tAfivIy (it) 

i T 

where r#(x) denotes the accuracy rate obtained with either the hybrid or tiebreaker 
classifier and x is the corresponding parameter employed, i.e. A for the hybrid classifier 
and 6 for the tiebreaker classifier. Figure 6 shows the values of AT(A) when the kNN 
is used to compose the hybrid classifier. The results obtained for additional values of 
the parameter k are shown in table 2. Figure 6 also shows the maximum accuracy rate 
obtained with the variation of A: 

Ar max = max Ar(A). (18) 

AS [0,1] 

With regard to the INT+CN combination, the addition of topological features of 
texts yielded accuracy rates much higher than the ones obtained with intermittency 
features alone, since Ar max > 2 for k — {3,4,5}. Note that, in this case, the 
network approach alone performed better than the method based solely on intermittency 
features, because AT(A = 1) > 1. This observation might explain the high gain 
obtained when networks are included as an additional feature. As for the FR+CN 
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combination, the highest gain in performance was /AY max = 1.197, which was obtained 
for k = 5. The lowest gain in accuracy occurred for the BIG+CN combination. In 
this case, the maximum gain was /AY max = 1.070 for k — 5. When other classifiers 
were used to compose the hybrid classifier, similar results have been found (see table 
3): the highest and lowest improvement in performance occurred for the INT+CN 
and BIG+CN combinations, respectively. In general, the hybrid classification that 
combined traditional statistical features and topological measurements of networks 
improved the classification performance in the authorship recognition task. Interestingly, 
in several cases, the combination outperformed the accuracy rates obtained when the 
compounding strategies were analyzes separately, i.e. Th(A e]0,1[) > T#(A = 0) and 

Th(A e]0,l[) >T H (X = l). 


Table 2. Best relative accuracy rate AT max obtained with the hybrid classifier based 
on the k-nearest neighbors method. The threshold employed A* to obtain the best 
accuracy rate is also shown. Note that the improvement of accuracy occurred in 
several occasions. _ 
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1.524 

0.57 
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0.53 

2.026 
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2.242 

0.48 

Stopwords 

1.000 

0.00 

1.097 

0.41 

1.059 

0.51 

1.159 

0.45 

1.197 

0.56 

Characters 

1.000 

0.00 

1.000 

0.00 

1.024 

0.38 

1.034 

0.44 

1.070 

0.37 


Table 3. Best relative accuracy rate Ar max obtained with the hybrid classifier 
based on the Support Vector Machine (SVM), Random Forest (RFO) and Multi 
Layer Perceptron (MLP) methods. The threshold employed A* to obtain the best 
accuracy rate is also shown. Note that the improvement of accuracy occurred in 
several occasions. _ 
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MLP 

Ar 
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A* 

AT 
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A* 

Ar 
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A* 

Intermittence 

1.125 

0.25 

1.800 

0.59 

1.227 

0.53 

Stopwords 

1.176 

0.49 

1.529 

0.36 

1.045 

0.50 

Characters 

1.032 

0.20 

1.000 

0.00 

1.000 

0.00 


Traditional and topological attributes of texts were also combined using the 
tiebreaker classifier. The results obtained with the kNN classifier for k = {3,4,5} 
are shown in figure 7. The results obtained for other values of the parameter k are 
displayed in table 4. The tiebreaker classifier built using the INT+CN combination 
provided an increase in performance of up to 31%. The maximum gain, obtained with 
the combinations FR+CN and BG+CN, were respectively 34.2% and 4.8%. With regard 
to the other classifiers compounding the tiebreaker technique, the results obtained are 
shown in table 5. Interestingly, there was no improvement when the SVM was used 
(A Y max = 1). Differently, when the RFO classifier was used, an improvement of 
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Figure 6. Relative accuracy rate as a function of the topological weight (A) obtained 
with the k nearest neighbors. The attributes employed were: (a)-(c): intermittence; 
(d)-(f): stopwords; (g)-(i): characters. The parameters k of the /c-nearest neighbors 
were: k = 3 in (a), (d) and (g); k = 4 in (b), (e) and (h); and k = 5 in (g), (f) 
and (i). As it turns, there is an improvement of the accuracy rates when traditional 
methods are combined with the technique based on the topological analysis of complex 
networks. 
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performance was observed in all three combinations of attributes. Finally, the tiebreaker 
classifier built with the MLP only provided a gain in accuracy for the combination 
INT+CN. While providing a gain in performance in several scenarios, the tiebreaker 
technique usually performed worst than the hybrid classifications performed with the 
same compound classifiers and parameters. 


Table 4. Best relative accuracy rate AT max obtained with the tiebreaker method 
based on the k-nearest neighbors algorithm. The threshold employed 9* to obtain the 
best accuracy rate is also shown. Note that the improvement of accuracy occurs in 
several occasions. _ 
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0.47 

Characters 

1.000 
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0.02 

1.026 0.01 

1.016 

0.25 
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0.02 


Table 5. Best relative accuracy rate AT max obtained with the tiebreaker algorithm 
applied to the Support Vector Machine (SVM), Random Forest (RFO) and Multi Layer 
Perceptron (MLP) methods. The threshold employed 9* to obtain the best accuracy 
rate is also shown. Note that the improvement of accuracy depends upon the pattern 
recognition method and attributes employed.. 



SVM 

RFO 

MLP 

AF 

Ar 

—max 

9* 

AF f)* 

Intermittence 

1.000 

1.375 

0.02 

1.055 0.02 

Stopwords 

1.000 

1.294 

0.23 

1.000 

Characters 

1.000 

1.030 

0.06 

1.000 


The results in tables 2-5 confirms that the topology of networks modeling texts is 
able to improve the characterization of texts because such textual description grasps 
relevant patterns that are mostly disregarded by traditional statistical approaches. 
This can be confirmed by optimized results obtained for A* > 0 and 6* > 0. To 
better understand the factors behind the discriminability power of the network model, 
the relevance of each topological for discriminating authors was computed with the 
information gain criterion. The best topological features were the standard deviation 
of the accessibility at the third level (f2(Aa < - h=3 ' ) ) = 1.07), the standard deviation of the 
average neighboorhood degree (f2(AfcOd) = 1.05) and the average clustering coefficient 
(fl((C')) = 0.60). An example of discriminability provided by the topological strategy 
is shown in figure 8 . 

6.2. Style identification 

In this section, I investigate if the topological features of complex networks are useful to 
complement the traditional textual characterization for the style identification task [ 66 ] . 
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Figure 7. Relative accuracy rate as a function of the threshold ( 9 ) obtained with the 
tiebreaker algorithm applied to the the k nearest neighbors. The attributes employed 
were: (a)-(c): intermittence; (d)-(f): stopwords; (g)-(i): characters. The parameters k 
of the fc-nearest neighbors were: k = 3 in (a), (d) and (g); k = 4 in (b), (e) and (h); 
and k = 5 in (g), (f) and (i). As it turns, there is an improvement of the accuracy rates 
when traditional methods are combined with the technique based on the topological 
analysis of complex networks. 
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Figure 8. Discriminability of authors obtained with two topological features of 
complex networks modelling texts. Note that, using only two features, it was possible 
to separate e.g. Alger from Melville. According to the information gain criterion, the 
most relevant network features for the authorship identication task were the standard 
deviation of the accessibility computed at the third level (fi(Aal ft=3 )) = 1.07) and the 
standard deviation of the average neighboorhood degree (fi(Afchh) = 1.05). 


The dataset used was the Brown corpus [67], which comprises documents that are 
classified either as informative prose (e.g. press reportage, popular folklore and scientific 
manuscripts) or imaginative prose (e.g general fiction, romance and love stories). 
Differently from the authorship attribution task, the objective here is to cluster together 
documents written using the same style, regardless of their authorship. 

The results obtained for the style identification in the Brown corpus using the 
hybrid technique are shown in tables 6 and 7. Considering the INT+CN combination, 
the highest gain in performance were 29.0% and 25.8% for the kNN (k = 4) and 
SVM methods. Minor improvements in accuracy were observed for other combinations, 
which might be explained by the high discriminability rates observed for the traditional 
techniques based on the frequency of stopwords (T = 95.6%) and character bigrams 
(T = 94.05%). The results obtained with the tiebreaker technique were inferior to those 
obtained with the hybrid technique (results not shown). All in all, the results confirm 
that the inclusion of topological attributes might also improve the characterization of 
documents in the context of stylistic-based classification tasks. In this task, the most 
relevant topological features were the skewness and deviation of the clustering coefficient 
(fI( 7 (C)) = 0.172 and fl(AC) = 0.157). A visualization of the discriminability provided 
by topological features is shown in figure 9. 
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Table 6. Best relative accuracy rate T max obtained with the hybrid classifier based 
on the k-nearest neighbors method. The threshold employed A* to obtain the best 
accuracy rate is also shown. 
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Table 7. Best relative accuracy rate T max obtained with the tiebreaker algorithm 
applied to the Support Vector Machine (SVM), Random Forest (RFO) and Multi 
Layer Perceptron (MLP) methods. 
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Figure 9. Projection of the Brown dataset using topological features of networks 
modeling texts. The linear discriminant analysis [68] was employed to generate the 
visualization. Note that the variability of the documents classified as imaginative prose 
is lower than the variability of style observed for informative documents. 


7. Conclusions 

In this study, I have shown that the styles of texts can be captured by measuring the 
topological properties of the corresponding network representation. More important 
than just to note a dependency between the structure of networks and stylistic 
features of texts, this study devised techniques to combine traditional and topological 
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features, which were found to be of paramount importance to enhance the quality of 
current classification strategies. Two traditional stylometry tasks were studied: the 
authorship attribution and genre identification problems. In both tasks, the addition of 
topological features provided an improvement in classification performance. The highest 
improvement occurred with the hybrid classifier, which uses a linear convex combination 
of features from distinct textual evidences. The different nature of the quantities used to 
characterize texts suggests a complementary role in capturing distinct aspects of written 
texts. Because the adequate choice of parameters for the proposed technique is far from 
being a trivial task, I intend, as a future work, to devise a method that automatically 
assigns a value of A and 0 for each test instance. 
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